Raw: Represents raw bytes, often used for binary data. Example: charToRaw("A")
Matrix: A 2-dimensional array where all elements must be of the same type (numeric, character, etc.). Example:
my_matrix <-matrix(data =1:9,nrow =3,ncol =3)
Nested Data Structures in R
Data Frame: A table or 2-dimensional array-like structure where each column can contain different types of data (numeric, character, factor, etc.). Example:
data_input <-"Albania is a country with 2.8 million inhabitants, its capital is Tirana, and it was founded on 28 November 1912.Andorra is a country with 77,000 inhabitants, its capital is Andorra la Vella, and it was founded on 8 September 1278.Austria is a country with 8.9 million inhabitants, its capital is Vienna, and it was founded on 12 November 1918.Belarus is a country with 9.5 million inhabitants, its capital is Minsk, and it was founded on 25 August 1991.Belgium is a country with 11.5 million inhabitants, its capital is Brussels, and it was founded on 4 October 1830.Bosnia and Herzegovina is a country with 3.3 million inhabitants, its capital is Sarajevo, and it was founded on 1 March 1992.Bulgaria is a country with 6.9 million inhabitants, its capital is Sofia, and it was founded on 22 September 1908.Croatia is a country with 4 million inhabitants, its capital is Zagreb, and it was founded on 25 June 1991.Cyprus is a country with 1.2 million inhabitants, its capital is Nicosia, and it was founded on 16 August 1960.Czech Republic is a country with 10.7 million inhabitants, its capital is Prague, and it was founded on 1 January 1993.Denmark is a country with 5.8 million inhabitants, its capital is Copenhagen, and it was founded on 5 June 1849.Estonia is a country with 1.3 million inhabitants, its capital is Tallinn, and it was founded on 20 August 1991.Finland is a country with 5.5 million inhabitants, its capital is Helsinki, and it was founded on 6 December 1917.France is a country with 67 million inhabitants, its capital is Paris, and it was founded on 22 September 1792.Germany is a country with 83 million inhabitants, its capital is Berlin, and it was founded on 3 October 1990.Greece is a country with 10.4 million inhabitants, its capital is Athens, and it was founded on 25 March 1821.Hungary is a country with 9.6 million inhabitants, its capital is Budapest, and it was founded on 23 October 1989.Iceland is a country with 366,000 inhabitants, its capital is Reykjavik, and it was founded on 17 June 1944.Ireland is a country with 5 million inhabitants, its capital is Dublin, and it was founded on 6 December 1922.Italy is a country with 60 million inhabitants, its capital is Rome, and it was founded on 17 March 1861.Kosovo is a country with 1.8 million inhabitants, its capital is Pristina, and it was founded on 17 February 2008.Latvia is a country with 1.9 million inhabitants, its capital is Riga, and it was founded on 18 November 1918.Liechtenstein is a country with 39,000 inhabitants, its capital is Vaduz, and it was founded on 23 January 1719.Lithuania is a country with 2.8 million inhabitants, its capital is Vilnius, and it was founded on 11 March 1990.Luxembourg is a country with 634,000 inhabitants, its capital is Luxembourg City, and it was founded on 9 June 1815.Malta is a country with 514,000 inhabitants, its capital is Valletta, and it was founded on 21 September 1964.Moldova is a country with 2.6 million inhabitants, its capital is Chișinău, and it was founded on 27 August 1991.Monaco is a country with 39,000 inhabitants, its capital is Monaco, and it was founded on 8 January 1297.Montenegro is a country with 622,000 inhabitants, its capital is Podgorica, and it was founded on 3 June 2006.Netherlands is a country with 17.4 million inhabitants, its capital is Amsterdam, and it was founded on 26 July 1581.North Macedonia is a country with 2.1 million inhabitants, its capital is Skopje, and it was founded on 8 September 1991.Norway is a country with 5.4 million inhabitants, its capital is Oslo, and it was founded on 7 June 1905.Poland is a country with 38 million inhabitants, its capital is Warsaw, and it was founded on 11 November 1918.Portugal is a country with 10.3 million inhabitants, its capital is Lisbon, and it was founded on 5 October 1143.Romania is a country with 19 million inhabitants, its capital is Bucharest, and it was founded on 1 December 1918.Russia is a country with 144 million inhabitants, its capital is Moscow, and it was founded on 12 June 1990.San Marino is a country with 34,000 inhabitants, its capital is San Marino, and it was founded on 3 September 301.Serbia is a country with 6.7 million inhabitants, its capital is Belgrade, and it was founded on 5 June 2006.Slovakia is a country with 5.4 million inhabitants, its capital is Bratislava, and it was founded on 1 January 1993.Slovenia is a country with 2.1 million inhabitants, its capital is Ljubljana, and it was founded on 25 June 1991.Spain is a country with 47 million inhabitants, its capital is Madrid, and it was founded on 6 December 1978.Sweden is a country with 10.4 million inhabitants, its capital is Stockholm, and it was founded on 6 June 1523.Switzerland is a country with 8.3 million inhabitants, its capital is Bern, and it was founded on 12 September 1848.Ukraine is a country with 41 million inhabitants, its capital is Kyiv, and it was founded on 24 August 1991.United Kingdom is a country with 67 million inhabitants, its capital is London, and it was founded on 1 January 1801.Vatican City is a country with 825 inhabitants, its capital is Vatican City, and it was founded on 11 February 1929."
Rows: 46
Columns: 5
$ input <chr> "Albania is a country with 2.8 million inhabitants, its ca…
$ country <chr> "Albania", "Andorra", "Austria", "Belarus", "Belgium", "Bo…
$ inhabitants <chr> "8", NA, "9", "5", "5", "3", "9", "4", "2", "7", "8", "3",…
$ capital <chr> "Tirana", "Andorra la Vella", "Vienna", "Minsk", "Brussels…
$ founded <chr> "28 November 1912", "8 September 1278", "12 November 1918"…
replace all whitespace characters in the country columns of country_df with underscores (_)
remove all punctuation from the data_input vector
Extract the date from the string and turn it into a proper date vector:
string <-"Military defeats following the outbreak of the French Revolutionary Wars resulted in the insurrection of 10 August 1792. The monarchy was abolished and replaced by the French First Republic one month later."
Check which of the lines in data_input have the word “million” (hint: you need to split the string into a vector with str_split() first)
Now save a subset of the lines into a new object
From the following vector, I want you to write code that identifies the URLs from the German Wikipedia that do not use the secure Hypertext Transfer Protocol (HTTPS):
different types of information, clearly labelled as such
each line refers to an observation, each column contains a different variable
the variables are stored in different data types:
typeof(country_df$country)
[1] "character"
typeof(country_df$inhabitants)
[1] "integer"
typeof(country_df$founded)
[1] "double"
most statistical tools are designed to process data in tables (especially in the tidyverse)
country_df |>print(n =46)
# A tibble: 46 × 4
country inhabitants capital founded
<chr> <int> <chr> <date>
1 Albania 8000000 Tirana 1912-11-28
2 Andorra 77000 Andorra la Vella 1278-09-08
3 Austria 9000000 Vienna 1918-11-12
4 Belarus 5000000 Minsk 1991-08-25
5 Belgium 5000000 Brussels 1830-10-04
6 Bosnia and Herzegovina 3000000 Sarajevo 1992-03-01
7 Bulgaria 9000000 Sofia 1908-09-22
8 Croatia 4000000 Zagreb 1991-06-25
9 Cyprus 2000000 Nicosia 1960-08-16
10 Czech Republic 7000000 Prague 1993-01-01
11 Denmark 8000000 Copenhagen 1849-06-05
12 Estonia 3000000 Tallinn 1991-08-20
13 Finland 5000000 Helsinki 1917-12-06
14 France 67000000 Paris 1792-09-22
15 Germany 83000000 Berlin 1990-10-03
16 Greece 4000000 Athens 1821-03-25
17 Hungary 6000000 Budapest 1989-10-23
18 Iceland 366000 Reykjavik 1944-06-17
19 Ireland 5000000 Dublin 1922-12-06
20 Italy 60000000 Rome 1861-03-17
21 Kosovo 8000000 Pristina 2008-02-17
22 Latvia 9000000 Riga 1918-11-18
23 Liechtenstein 39000 Vaduz 1719-01-23
24 Lithuania 8000000 Vilnius 1990-03-11
25 Luxembourg 634000 Luxembourg City 1815-06-09
26 Malta 514000 Valletta 1964-09-21
27 Moldova 6000000 Chișinău 1991-08-27
28 Monaco 39000 Monaco 1297-01-08
29 Montenegro 622000 Podgorica 2006-06-03
30 Netherlands 4000000 Amsterdam 1581-07-26
31 North Macedonia 1000000 Skopje 1991-09-08
32 Norway 4000000 Oslo 1905-06-07
33 Poland 38000000 Warsaw 1918-11-11
34 Portugal 3000000 Lisbon 1143-10-05
35 Romania 19000000 Bucharest 1918-12-01
36 Russia 144000000 Moscow 1990-06-12
37 San Marino 34000 San Marino 301-09-03
38 Serbia 7000000 Belgrade 2006-06-05
39 Slovakia 4000000 Bratislava 1993-01-01
40 Slovenia 1000000 Ljubljana 1991-06-25
41 Spain 47000000 Madrid 1978-12-06
42 Sweden 4000000 Stockholm 1523-06-06
43 Switzerland 3000000 Bern 1848-09-12
44 Ukraine 41000000 Kyiv 1991-08-24
45 United Kingdom 67000000 London 1801-01-01
46 Vatican City 825 Vatican City 1929-02-11
Country data to plots
country_df |>count(founded =year(founded)) |>arrange(founded) |>mutate(countries =cumsum(n)) |>ggplot(aes(x = founded, y = countries)) +geom_line() +labs(x =NULL, y =NULL, title ="Number of countries in Europe over time*",caption ="*Dissolved countries were ignored")
country_df |>mutate(country =fct_reorder(country, inhabitants)) |>ggplot(aes(x = inhabitants, y = country)) +geom_col() +labs(x =NULL, y =NULL, title ="Largest countries in Europe*",caption ="*By inhabitants")
Accessing Data
Say you want the inhabitants for row 45 (United Kingdom)
# A tibble: 1 × 4
country inhabitants capital founded
<chr> <int> <chr> <date>
1 United Kingdom 67000000 London 1801-01-01
Or all rows of countries older than 1300:
country_df |>filter(founded <"1300-01-01")
# A tibble: 4 × 4
country inhabitants capital founded
<chr> <int> <chr> <date>
1 Andorra 77000 Andorra la Vella 1278-09-08
2 Monaco 39000 Monaco 1297-01-08
3 Portugal 3000000 Lisbon 1143-10-05
4 San Marino 34000 San Marino 301-09-03
Updating Data
Let’s say you don’t like that the founding of France is set to the founding of the French Republic (22 September 1792) and you want to use the date of the Treaty of Verdun (10 August 843) instead:
France <-tibble(country ="France", inhabitants =67000000L, capital ="Paris", founded =as.Date("843-08-10"))country_df |>rows_update(France, by ="country") |>filter(founded <"1300-01-01")
# A tibble: 5 × 4
country inhabitants capital founded
<chr> <int> <chr> <date>
1 Andorra 77000 Andorra la Vella 1278-09-08
2 France 67000000 Paris 843-08-10
3 Monaco 39000 Monaco 1297-01-08
4 Portugal 3000000 Lisbon 1143-10-05
5 San Marino 34000 San Marino 301-09-03
Adding Data
What if a new country was founded in Europe today?
essland <-tibble(country ="Essex Summer School", inhabitants =38*15, capital ="Colchester Campus", founded =Sys.Date())country_df |>add_case(essland) |>arrange(desc(founded)) |>head()
# A tibble: 6 × 4
country inhabitants capital founded
<chr> <dbl> <chr> <date>
1 Essex Summer School 570 Colchester Campus 2024-07-17
2 Kosovo 8000000 Pristina 2008-02-17
3 Serbia 7000000 Belgrade 2006-06-05
4 Montenegro 622000 Podgorica 2006-06-03
5 Czech Republic 7000000 Prague 1993-01-01
6 Slovakia 4000000 Bratislava 1993-01-01
Deleting Data
Now say you are only interested in countries with more than one million inhabitants. You can delete superfluos data with:
ggplot(good_table) +geom_point(aes(x = population, y = country, colour =as.factor(year)))
Avoid reduncancy
Some variables are connected to groups and never (or almost never) change
The constant repetition introduces redundant information in our dataset
Depending on the size of our data, this can lead to increased processing time, a larger memory footpint, and more demand for storage
Conceptually redundancy is introduced when we store information about different entities (that refer to each other) in a single table (here countries and population)
# A tibble: 2 × 2
country capital
<chr> <chr>
1 Switzerland Bern
2 Austria Vienna
Avoid reduncancy: When not to
During data processing and management ✅
During data analysis ❌ (at least not strictly)
During analysis, you often want to compare, contrast, connect entities, so you need data about all of them
Create “analysis datasets” that contain all the data that you currently need
analysis_datasets <-merge(good_table_population, good_table_countires, by ="country")analysis_datasets
country year population capital
1 Austria 1950 6.9 Vienna
2 Austria 1960 7.1 Vienna
3 Austria 1970 7.5 Vienna
4 Switzerland 1950 4.7 Bern
5 Switzerland 1960 5.3 Bern
6 Switzerland 1970 6.2 Bern
“Happy families are all alike; every unhappy family is unhappy in its own way.”
— Leo Tolstoy
“Tidy datasets are all alike, but every messy dataset is messy in its own way.”
— Hadley Wickham
Tidyverse got its name from the idea of tidy data, which is defined as (Wickham 2014):
Each variable forms a column.
Each observation forms a row.
Each value is a cell; each cell is a single value.
(Each type of observational unit forms a table.)
Any other form of data is by definition messy data. This is essentially the same definition like we’ve seen above: tidy data = good table.
Indentify the tidy Data
table1
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
table2
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
table3
# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
Reshaping
From wide to long:
tidy_table <- bad_table |>pivot_longer(cols =starts_with("pop"), # define which columns this applies tonames_to ="year", # name the new variable containing the column namesvalues_to ="population"# name the new variable containing the cell values ) |>mutate(year =as.integer(str_extract(year, "\\d+")))
As said above, this is exactly the same principle we looked at for a good table:
# A tibble: 9 × 4
family child dob name
<int> <chr> <date> <chr>
1 1 child1 1998-11-26 Susan
2 1 child2 2000-01-29 Jose
3 2 child1 1996-06-22 Mark
4 3 child1 2002-07-11 Sam
5 3 child2 2004-04-05 Seth
6 4 child1 2004-10-10 Craig
7 4 child2 2009-08-27 Khai
8 5 child1 2000-12-05 Parker
9 5 child2 2005-02-28 Gracie
pivot_wider
Not as common, but sometimes one or two columns contain several variables:
cms_patient_experience
# A tibble: 500 × 5
org_pac_id org_nm measure_cd measure_title prf_rate
<chr> <chr> <chr> <chr> <dbl>
1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP… CAHPS for MI… 63
2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP… CAHPS for MI… 87
3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP… CAHPS for MI… 86
4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP… CAHPS for MI… 57
5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP… CAHPS for MI… 85
6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP… CAHPS for MI… 24
7 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI… 59
8 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI… 85
9 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI… 83
10 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI… 63
# ℹ 490 more rows
In this case, we need to do the opposite operations: widening data
# A tibble: 2 × 4
country pop1950 pop1960 pop1970
<chr> <dbl> <dbl> <dbl>
1 Switzerland 4.7 5.3 6.2
2 Austria 6.9 7.1 7.5
We have reconstructed bad table 😓
pivot_wider: fix patient data
cms_patient_experience |>pivot_wider(id_cols =starts_with("org"), # we use both columns as id just to not widen themnames_from = measure_cd,values_from = prf_rate )
# A tibble: 6 × 6
name age scores scores_id details details_id
<chr> <dbl> <dbl> <chr> <chr> <chr>
1 Alice 25 85 game1 123 Main St address
2 Alice 25 85 game1 Anytown city
3 Alice 25 90 game2 123 Main St address
4 Alice 25 90 game2 Anytown city
5 Alice 25 88 game3 123 Main St address
6 Alice 25 88 game3 Anytown city
# A tibble: 3 × 4
name age address city
<chr> <dbl> <chr> <chr>
1 Alice 25 123 Main St Anytown
2 Alice 25 123 Main St Anytown
3 Alice 25 123 Main St Anytown
games <- sports_results_df |>mutate(game =str_extract(scores_id, "\\d+")) |>select(name, scores, game)games
# A tibble: 6 × 3
name scores game
<chr> <dbl> <chr>
1 Alice 85 1
2 Alice 85 1
3 Alice 90 2
4 Alice 90 2
5 Alice 88 3
6 Alice 88 3
Since we already know that the data contains the latest Tweets, we can use a function I wrote to search for it in the data:
parse_path <-function(ix) { out <-as.list(ix$p) out[which(ix$p ==as.character(ix$pos))] <- ix$pos[ix$p ==as.character(ix$pos)]gsub("list(", "purrr::pluck(DATA, ", deparse1(out), fixed =TRUE)}#' Search a list#'#' @param l a list#' @param f a function to identify the element you are searching#'#' @return an object containing the searched element with the function to extract it as a name#' @exportlist_search <-function(l, f) { paths <- rrapply::rrapply(object = l,condition = f,f =function(x, .xparents, .xname, .xpos) list(p = .xparents, n = .xname, pos = .xpos),how ="flatten" ) out <- purrr::map(paths, function(p) purrr::pluck(l, !!!p$pos))names(out) <- purrr::map_chr(paths, parse_path)return(out)}
list_search(ess_tweets, function(x) str_detect(x, "Only one week to go until all our session three courses close!"))
$`purrr::pluck(DATA, "data", "user", "result", "timeline_v2", "timeline", "instructions", 3L, "entries", 2L, "content", "itemContent", "tweet_results", "result", "legacy", "full_text")`
[1] "RT @EssexSumSchool: Only one week to go until all our session three courses close! \n\nMost of these courses take place from 5 – 16 August an…"
$`purrr::pluck(DATA, "data", "user", "result", "timeline_v2", "timeline", "instructions", 3L, "entries", 2L, "content", "itemContent", "tweet_results", "result", "legacy", "retweeted_status_result", "result", "legacy", "full_text")`
[1] "Only one week to go until all our session three courses close! \n\nMost of these courses take place from 5 – 16 August and there are 11 courses available including machine learning for different types of data and agent-based models.\n\nFind out more: https://t.co/ThWnkYp0ye\n#ESS2024 https://t.co/fsve1tEdhd"
$`purrr::pluck(DATA, "data", "user", "result", "timeline_v2", "timeline", "instructions", 3L, "entries", 3L, "content", "itemContent", "tweet_results", "result", "legacy", "full_text")`
[1] "Only one week to go until all our session three courses close! \n\nMost of these courses take place from 5 – 16 August and there are 11 courses available including machine learning for different types of data and agent-based models.\n\nFind out more: https://t.co/ThWnkYp0ye\n#ESS2024 https://t.co/fsve1tEdhd"
Twitter data case study: getting the right doll
Going one level up from this tweet might contain more useful information:
Looks like this is the entire tweet. Additionally, the part of the path called “entries” might suggest we can just extract all tweets relatively easily:
It seems all of these entries are structured in the same way, which is a good sign. So hopefully we can just extract the content from the same position in all of them:
tweets <-map(entries, function(x) pluck(x, "content", "itemContent", "tweet_results", "result", "legacy"))# alternatively, this produces the same outcome# tweets <- map(entries, c("content", "itemContent", "tweet_results", "result", "legacy"))lobstr::tree(tweets, max_depth =2, max_length =25)
The object below is a nested list with Game of Thrones characters. Tidy the data into a data.frame that contains at least name, gender, culture and birthday of each character
got_characters <- repurrrsive::got_chars
From the got_characters object, also extract in which season(s) they appear, so that the data frame contains one row per season appearance (and still the same character information as above)
Sometimes, it makes sense to keep nested information around for later. The got_characters object contains aliases for the characters. Store them along the character name in a nested format, so that each row looks like this:
.table-striped {> tbody > tr:nth-of-type(odd)>* {background-color:#fff9ce; }}.table-hover {> tbody > tr:hover>* {background-color:#ffe99e;/* Adjust this color as needed */ }}
References
Atteveldt, Wouter van, Damian Trilling, and Carlos Arcíla. 2021. Computational Analysis of Communication: A Practical Introduction to the Analysis of Texts, Networks, and Images with Code Examples in Python and R. Hoboken, NJ: John Wiley & Sons. https://cssbook.net.
Weidmann, Nils B. 2023. Data Management for SocialScientists: FromFiles to Databases. 1st ed. Cambridge University Press. https://doi.org/10.1017/9781108990424.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd edition. Beijing Boston Farnham Sebastopol Tokyo: O’Reilly.